MSt Healthcare Data Science

Authentication of practice

I confirm that I have fully read and understood the assignment brief for this module. Y

Details

Name: Chung Yan Surname: Yu
Submission Date 19-10-25
Word Count: whole assignment including codes
Word Count: main body excluding abstract, references and supplementary materials

Permission to share your assignment

I do give permission to share my assignment with future MSt participants.

University statement of originality

This assignment is the result of my own work and includes nothing which is the outcome of work done in collaboration except as declared in the Preface and specified in the text. It is not substantially the same as any that I have previously submitted for a degree or diploma or other qualification at the University of Cambridge or any other university or similar institution, or that is being concurrently submitted, except as declared in the Preface and specific in the text. I further state that no substantial part of my Portfolio has already been submitted, nor is being concurrently submitted for any such degree, diploma or other qualification at the University of Cambridge or any other university or similar institution except as declared in the Preface and specified in the text.

I confirm the statement of originality as above Y

Questions for reflection

Self-assessment is an important aspect of feedback literacy, which is, in turn, key to the development of expertise. As you proceed through the MSt Healthcare data science programme, we hope that you will make use of the following prompts to assess your own work on assignments. Specific assignment briefs will likely indicate which of these to address for which assessments, but, in general, we expect you to respond to one or two for each assignment on your course.

For each of the questions, do not spend too long answering – keep it brief. For each question you answer, limit yourself to no more than three items. And please remember, this is optional and developmental: these cover sheets are designed to create space for self-assessment and feedback dialogue, rather than additional assignment workload.

  1. Which aspects of this assignment are you most uncertain about and/or would most like to receive feedback on?
  2. What elements are you left pondering after this assignment that you would like to discuss further?
  3. How have you incorporated feedback from peers and tutors into this assignment?
  4. How, and to what extent, have you been able to incorporate feedback on previous course work into this assignment?
  5. Using the wording in the rubric, how would you describe the quality of the different aspects of your work?

Declaration of the use of generative AI

Which permitted use of generative AI are you acknowledging? Semantic search of literature and notes, outline creation, code debugging, output formatting and generation, informed feedback
Which generative AI tool did you use (name and version)? Claude Code v2.0.3x (the version likely changed over usage), Claude Desktop v1.0.3x with PubMed Connector, M365 CoPilot
What did you use the tool for? Searching for relevant journal publications, sanity check on coding strategies, collating notes in my Obsidian vault, debugging code by parsing error messages
How have you used or changed the generative AI’s output E.g. my collated notes always stay in point form so that I write out the paragraphs in my own words. Feedback provided by the GenAI models are weighed and assessed before being acted on. Code generated are checked, e.g. it tried to run Levene test on a model fitted using residuals as the response, this was rectified back to the relevant response variable.

Abstract

Diffuse large B cell lymphoma (DLBCL) is the most common non-Hodgkin lymphoma, with prognosis influenced by diverse factors ranging from demographic features (age, gender) to integrated scoring systems such as the International Prognostic Index (IPI), and genetic mutations altering gene expression levels. The IPI was updated to National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI), where age was updated from a binary category (cut off at 60) to a quaternary category (≤40, 41 - 60, 61 - 75, >75). Oncogenes MYC, BCL2 and BCL6 emerged with strong associations with DLBCL, and further studies have identified over 150 genetic drivers of this disease. Yet how demographic factors interact with this expanding catalogue of genetic markers and the latter’s predictive powers on clinical outcomes remains understudied. This study analysed clinical and genomic data from 1001 DLBCL patients to investigate whether age and gender modify the prognostic associations between gene expression markers and complete therapy response. Age was found to have significant effect on the gene expression levels of 10 genes and the prognostic power of 1 gene on therapy response. The 10 genes BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.0096), SH3BP5 (p = 0.0156), ITPKB (p = 0.0203), CCND2 (p = 0.0203), PTPN1 (p = 0.0314), BMF (p = 0.0339), FUT8 (p = 0.0367), LMO2 (p = 0.0377) (all p-values are FDR adjusted), were found to have different expression levels across NCCN-IPI age groups. The CCND2 also had significant interaction effects (Likelihood ratio test adjusted p-value = 0.005523) with age on predicting complete response outcomes, with older (61-75, >75) age groups’ complete response probability increasing with CCND2 expression, while younger (≤40, 41-60) age groups’ complete response probability decreasing with CCND2 expression. This highlights the need to consider the prognostic value of genomic drivers and age in concert rather than individually.

Introduction

Diffuse large B cell lymphoma (DLBCL) is the most common type of aggressive non-Hodgkin’s lymphoma (NHL) (The Non-Hodgkin’s Lymphoma Classification Project 1997; Wang 2023), with a wide range of factors contributing to prognosis. For a comparable and more comprehensive pre-treatment prognosis model, the international prognostic index (IPI) emerged as a standardised model for DLBCL and other aggressive NHLs (Shipp et al. 1993), combining 5 factors: age, Ann Arbor stage classification, ECOG Performance Status, serum lactate dehydrogenase (LDH) level and the number of extranodal sites of disease. By pooling factors together, the IPI outperformed simple disease stage classification models such as Ann Arbor. (Zhou et al. 2014) went on to improve on the IPI, creating the National Comprehensive Cancer Network International Prognostic Index (NCCN-IPI) as an update to the IPI. A key innovation of the NCCN-IPI was using cubic splines to model the continuous factors of age and LDH levels as higher resolution categorical factors, modifying them from binary factors to quaternary and ternary factors respectively. As prognosis models improved, so did treatment. The first generation chemotherapy of cyclophosphamide, doxorubicin, vincristine, and prednisone (CHOP) had a cure rate of 30-35% (Fisher et al. 1993). With the addition of rituximab to the regimen (R-CHOP) and other advances, DLBCL becomes curable for more than 60% of patients (Johnson et al. 2012).

While age is incorporated into the IPI (and subsequent NCCN-IPI), other demographic factors such as gender are not, despite established evidence for their prognostic involvement. Males demonstrate higher hazard ratios (Carella et al. 2013) and gender-associated pharmacokinetic differences in rituximab lead to worse treatment responses in males (Habermann 2014). These demographic effects raise questions about whether risk factors perform uniformly across patient subgroups. Beyond demographic and clinical factors, a series of genetic alterations have emerged as strong prognostic markers: BCL2 (Gascoyne et al. 1997), BCL6 (Lo Coco et al. 1994) and MYC (Chenevix-Trench et al. 1986). These changes are observed at both the genotypic and phenotypic levels, as translocation mutations and protein overexpression respectively (Petrich et al. 2014). Furthermore, patients with varying combinations of these alterations have shown worse responses to R-CHOP treatments, highlighting the clinical relevance of genomic profiling. Some of these genetic markers, such as BCL6 (Klapper et al. 2012) and MYC (Kurz et al. 2025), have shown association with patient age, adding another layer of interaction that could influence prognosis. Klapper et al. (2012) further demonstrated that IRF4 translocations are associated with age, and along with BCL6 and other genetic drivers, collectively lose prognostic significance when age is incorporated in multivariate models. This indicates that demographic factors may not merely add to genomic risk additively, but could modify how these genomic drivers influence clinical outcomes.

With the advent of machine learning methods came attempts to construct models based on clinical and genomic data. Reddy et al. (2017a) performed whole exome and transcriptome sequencing in 1001 DLBCL patients, identifying 150 genetic drivers including BCL2, BCL6, and MYC alongside numerous newly characterised candidates. By combining genotypic alterations, gene expression data and cell-of-origin classification, they developed a genomic risk model that outperformed the IPI. However, the potential for demographic factors to modify these genomic associations remains understudied. Given that age and gender both demonstrate prognostic significance and influence treatment response, it remains unknown whether genomic risk markers perform uniformly across demographic subgroups. If these demographic factors modify how gene expression levels impact treatment response, such that the same gene expression levels in different gender or age groups lead to different prognostic outcomes, it would be crucial that these differences are identified and applied in future models to prevent inaccurate prognoses.

This study addresses this gap by investigating whether age and gender interact and modify the prognostic associations between gene expression markers and therapy response in DLBCL. The associations between the demographic factors (age & gender) with complete therapy response and the gene expression levels of 21 genes were first considered. Logistic regression models using gene expression levels of 21 genes were fitted with therapy response, and the interaction of demographic factors with these gene expression levels were examined. Exploring how demographic factors interact with known and potential genomic drivers of DLBCL could reveal age or gender-dependent tumour biology that inform therapeutic approaches.

Here the same datasets (Reddy et al. 2017b) from the genomic risk model study by Reddy et al. (2017a) are utilised, specifically their clinical information dataset from Supplementary Table S1 and gene expression dataset from Supplementary Table S2. The clinical information dataset comprises 1001 DLBCL patients with 35 variables. Patients are recorded with anonymised IDs with the variables of interest to us being gender, response to initial therapy, age at diagnosis, and the expression level of MYC, BCL2 and BCL6 genes based on log2 transformation of RNA-sequencing Fragments per kilobase (FPKM) value. These columns are highlighted and used in further analysis not only due to their relevance, but also their relative completeness. The gene expression dataset comprises the gene expression level (log2 transformed again) of 19 genes (1 gene BCL6 is also recorded in the clinical information dataset) for 775 patients recorded using the same anonymised IDs.

Methods

The datasets are available via Elsevier’s open access license, as shown on the ScienceDirect webpage of Reddy et al. (2017a)’s publication. For their work, the authors obtained anonymised patient clinical information and tumours, which were processed in line with a protocol approved by the Institutional Review Board at Duke University. A total number of 2 datasets were downloaded as Excel files directly from their links on the aforementioned ScienceDirect, Table S1 and Table S2 using the download.file function from R’s utils package (R Core Team 2025) and read via the readxl package (Wickham and Bryan 2025). The specific sheets were selected then cleaned to be exported as .csv files, all this pre-processing is documented in the 01-data-preprocessing.R script. The processed .csv files were then uploaded to this project’s GitHub repo as mmc1-ClinicalInformation.csv (raw) and mmc2-GeneExpression.csv (raw). This RMarkdown document is self-sufficient to contain the code for the relevant analysis and plots shown only, for a complete analysis including assumption testing, post-hoc analysis, tests that showed insignificant results, please refer to the supplementary-materials folder of this repository. This document also supports sourcing the datasets from 3 sources for redundancy: original ScienceDirect links, processed .csv files on GitHub and local processed .csv files when this repository is downloaded. For this knitted HTML document, both datasets were loaded directly from ScienceDirect’s supplementary materials: S1 Table (clinical data) and S2 Table (gene expression data).

The 2 datasets were then joined using common patient IDs, and 2 new columns were added. One column was called age_group_nccn, obtained by transforming the age at diagnosis variable into 4 categorical groups according to cutoffs defined by Zhou et al. (2014)’s NCCN-IPI: \(≤40, 41 - 60, 61 - 75, >75\). Another column called complete_response was added by taking the values from the response_therapy column, re-classifying “Partial response” and “No response” as “Incomplete response” while keeping “Complete response” unchanged. This was to prepare the dataset to be fitted for binomial logistic regression. As both datasets had gene expression levels for BCL6, the two columns were compared to verify identicality and 1 was discarded. The resulting table was one with 775 and 27 variables, of which include the anonymised patient ID; demographic variables such as gender (binary: F/M), age (continuous, at diagnosis) and age group (quaternary, as defined by NCCN-IPI); clinical information of therapy response (ternary: Complete response, Partial response, No response); and the log2 transformed gene expression levels of 21 genes: MYC, BCL2, BCL6, ITPKB, MME, MYBL1, DENND3, NEK6, LMO2, LRMP, SH3BP5, IRF4, PIM1, ENTPD1, BLNK, CCND2, ETV6, FUT8, BMF, IL16 and PTPN1.

Gene expression data were limited to these 21 genes available from Reddy et al. (2017b)’s datasets (Supplementary Tables S1 & S2), representing 2 categories. The first category is the “big 3” oncogenic drivers of MYC, BCL2 and BCL6, serving as strong prognostic markers. Moreover, MYC (Kurz et al. 2025) and BCL6 (Klapper et al. 2012) have both shown age-dependent prognostic effects, suggesting age may modify how these markers influence clinical outcomes. The second category comprises 19 cell-of-origin classifier genes (DLBCL subtypes: ABC vs GCB) (Wright et al. 2003), with BCL6 appearing in both categories. Despite their initial utility for classification, gene members such as CCND2 and LMO2 (Lossos et al. 2004) have shown to be strong survival predictors alongside BCL2 and BCL6. More importantly, IRF4 is among the genetic features showing age associations and contributes to genetic complexity that loses prognostic significance when age is incorporated (Klapper et al. 2012), alongside BCL6. Both categories therefore contain genes with potential to interact with demographic factors, making them suitable for testing whether demographic factors modify their prognostic associations.

Overall, data completeness for these 27 columns was exceptional, with 23 columns being 100%, and the remaining 5 columns all with more than 93% completeness. This processing is all done with the 02-data-processing.R script. Characteristics of the cohort are summarised in Table 1.

Table 1. Baseline Cohort Characteristics

Characteristic Total (N=775) ≤40 years 41-60 years 61-75 years >75 years
Sample size N = 775 n = 75 n = 249 n = 285 n = 135
Gender
Female 338 (43.6%) 20 (26.7%) 91 (36.5%) 129 (45.3%) 77 (57%)
Male 437 (56.4%) 55 (73.3%) 158 (63.5%) 156 (54.7%) 58 (43%)
Age at diagnosis, years 61 ± 15.4 29.5 ± 8 52.1 ± 5.2 67.7 ± 4.4 80.4 ± 4
Treatment response
Complete response 598 (82.6%) 67 (89.3%) 202 (83.1%) 218 (82.3%) 94 (77%)
Incomplete response 126 (17.4%) 8 (10.7%) 41 (16.9%) 47 (17.7%) 28 (23%)

Demographic and clinical characteristics stratified by NCCN-IPI age groups.
Values shown as n (%) for categorical variables and mean ± SD for continuous variables.
NCCN-IPI age groups represent National Comprehensive Cancer Network International Prognostic Index classifications.
Treatment response categories combine partial and no response as “Incomplete response” versus complete response.
Percentages are calculated within each age group for gender distributions and treatment responses.

First, the variation of gene expression levels of the 21 genes in different gender and age groups were scrutinised using the Student’s t-test and one-way Welch ANOVA respectively, and with false discovery rate (FDR) adjustment for the p-values (Reiner et al. 2003) due to the number of genes analysed (see 04-data-analysis.R). Post-hoc Games-Howell tests were then conducted for signficant ANOVA results. Assumptions for these tests were checked with diagnostic tests and plots, and the overall distribution of gene expression against age and gender visually examined (see 03-data-exploration.R). The normality of the data is already satisfied by the virtue of large N (\(n = 775\)) and high data completeness. Second, the association of therapy response completeness with gender and age groups were also studied with the Chi-squared tests with expected frequencies checked to be \(> 5\) in 05-response-analysis.R. Last but not least, logistic regression models were fitted to predict response completeness and test how age or gender groupings affect predictions. Null (response ~ 1) models were fitted as the baseline, with gene expression only models following (response ~ gene expression), and then with main effect (response ~ gene expression + age/gender), and then with interaction effects (response ~ gene expression × age/gender). The effects of age or gender group were investigated independently. The fitted models were then analysed with the likelihood-ratio test (LRT) to obtain FDR-corrected p-values. Significant models were then checked for model fit with goodness-of-fit with Chi-square tests and Akaike Information Criterion (AIC) scores, along with influential observations and collinearity (Hodgson et al. 2025), all recorded in 06-effect-analysis.R

Results

Gene expression levels showed no significant difference across gender groups after t-test with FDR correction. Concerning age groups, 10 genes showed significant difference after Welch ANOVA with FDR correction: BCL2 (p = 0.0002), BLNK (p = 0.003), MYBL1 (p = 0.0096), SH3BP5 (p = 0.0156), ITPKB (p = 0.0203), CCND2 (p = 0.0203), PTPN1 (p = 0.0314), BMF (p = 0.0339), FUT8 (p = 0.0367), LMO2 (p = 0.0377). Post-hoc analysis with Games-Howell showed the age groups were mostly broken down into 2 groups, save for BLNK with 3 groups. These 10 genes along with their expression levels across the 4 age groups, and the similarity between the age groups, are shown in Figure 1.

Figure 1. Gene Expression Patterns Across Age Groups.

Fig. 1
Box plots show expression distribution by age group for genes with significant ANOVA results (FDR-adjusted p < 0.05).
Compact letter display (CLD) letters (a, b, c) above each boxplot indicate statistical groupings from Games-Howell post-hoc tests; groups with different letters differ significantly (p < 0.05).
Sample sizes (n=X) are shown below each age group.
Individual data points are overlaid with outliers highlighted in darker borders.

Concerning therapy response across gender and age groups, both showed insignificant differences, with the complete response distribution almost uniform across genders (p = 0.9999999999999984). The distribution across age groups showed some variations, but statistically insignificant (p = 0.173), with the trend of incomplete response proportion rising with age. To examine the raltion between gene expression levels and treatment response, logistic regression analysis with gender or age group interaction for 21 genes was performed. After fitting logistic regression models for all 21 genes against null, gene effect, main effect and interaction effet models, 2 genes of interest emerged with statistical significant. First the MYC gene showed significant gene effects (gene model LRT FDR-adjusted p-value = 0.009954), with acceptable goodness-of-fit score (chi-square p-value = 0.9467) as well as the best (lowest) AIC score (647.57) out of the 4 models for MYC and scoring 10.22 lower than the null model. Second, the CCND2 gene showed significant age interaction effects (age intereaction model LRT FDR-adjusted p-value = 0.005523), with also good goodness-of-fit score (chi-square p-value = 0.9726), and again the best (lowest) AIC score (643.12) out of the 4 models for CCND2, scoring at least more than 12 points lower than all the other models. No other gene showed significance for gene, main nor interaction (age and gender both) effects. The odds ratios of the gene effects of all 21 genes are show in Figure 2.

Looking further at CCND2 age-interaction model’s logistic regression plot (Figure 3), it becomes clear that higher CCND2 expression levels affect the response completeness of the age groups differently. For CCND2, the curves for age groups ≤40 and 41-60 and the curves for age groups 61-75 and >75 are going in opposite directions. Complete response probability decreases with CCND2 expression level for the 2 younger groups, while complete response probability increases with CCND2 expression level for the 2 older groups. This sharply contrasts with the logistic regression plot for MYC’s age-interaction model, where all 4 age groups’ curves are going in the same direction, with complete response probability decreasing for all age groups when MYC expression increases. Further scrutinising the odds ratio of the 4 age groups for the CCND2 age-interaction model (Figure 4), both older age groups (61-75, >75) showed significant higher complete response odds, while the trends for the younger age groups (≤40, 41-60) are worse odds for complete response, they were not significant.

Figure 2. Logistic Regression Models - Gene Expression Effects on Complete Response

Fig. 2
Orange: Adjusted p-values < 0.05
Teal: Adjusted p-values ≥ 0.05, unadjusted p-values < 0.05
Light Purple: Unadjusted p-values ≥ 0.05 (insignificant)
Confidence interval: 95%
MYC is the only gene showing significant gene effect after FDR correction, showing even with confidence interval error bars, it is well clear of the 1.0 reference (no effect) line.

Figure 3. Logistic Regression Models - Gene Expression Effects on Complete Response

Fig. 3
A. Logistic regression plot of MYC gene expression against complete response probability, age-interaction model
  • The MYC gene showed no significant effects for the interaction model, this is used as a reference to compare with CCND2

B. Logistic regression plot of CCND2 gene expression against complete response probability, age-interaction model

Discussion

Conclusion

References

Carella, Angelo M., Carmino A. de Souza, Stefano Luminari, et al. 2013. “Prognostic Role of Gender in Diffuse Large b-Cell Lymphoma Treated with Rituximab Containing Regimens: A Fondazione Italiana Linfomi/Grupo de Estudos Em Moléstias Onco-Hematológicas Retrospective Study.” Leukemia & Lymphoma 54 (1): 53–57. https://doi.org/10.3109/10428194.2012.691482.
Chenevix-Trench, Georgia, Frederick G. Behm, and Eric H. Westin. 1986. “Somatic Rearrangement of the c-Myc Oncogene in Primary Human Diffuse Large-Cell Lymphoma.” International Journal of Cancer 38 (4): 513–16. https://doi.org/https://doi.org/10.1002/ijc.2910380410.
Fisher, Richard I., Ellen R. Gaynor, Steve Dahlberg, et al. 1993. “Comparison of a Standard Regimen (CHOP) with Three Intensive Chemotherapy Regimens for Advanced Non-Hodgkin’s Lymphoma.” New England Journal of Medicine 328 (14): 1002–6. https://doi.org/10.1056/NEJM199304083281404.
Gascoyne, Randy D., Sheryle A. Adomat, Stanislaw Krajewski, et al. 1997. “Prognostic Significance of Bcl-2 Protein Expression and Bcl-2 Gene Rearrangement in Diffuse Aggressive Non-Hodgkin’s Lymphoma.” Blood 90 (1): 244–51. https://doi.org/https://doi.org/10.1182/blood.V90.1.244.
Habermann, Thomas M. 2014. “Is Rituximab One for All Ages and Each Sex?” Blood 123 (5): 602–3. https://doi.org/10.1182/blood-2013-12-543314.
Hodgson, Vicki, Matt Castle, Rob Nicholls, and Martin van Rongen. 2025. Generalised Linear Models. https://cambiotraining.github.io/stats-glm.
Johnson, Nathalie A., Graham W. Slack, Kerry J. Savage, et al. 2012. “Concurrent Expression of MYC and BCL2 in Diffuse Large b-Cell Lymphoma Treated with Rituximab Plus Cyclophosphamide, Doxorubicin, Vincristine, and Prednisone.” Journal of Clinical Oncology 30 (28): 3452–59. https://doi.org/10.1200/JCO.2011.41.0985.
Klapper, Wolfram, Markus Kreuz, Christian W. Kohler, et al. 2012. “Patient Age at Diagnosis Is Associated with the Molecular Characteristics of Diffuse Large b-Cell Lymphoma.” Blood 119 (8): 1882–87. https://doi.org/https://doi.org/10.1182/blood-2011-10-388470.
Kurz, Katrin S., Sophia Steinlein, Markus Kreuz, et al. 2025. “Age- and Gender-Specific Molecular Characteristics of Diffuse Large b-Cell Lymphoma: Results from Clinical Trials of the DSHNHL/GLA.” HemaSphere 9 (3): e70093. https://doi.org/https://doi.org/10.1002/hem3.70093.
Lo Coco, Francesco, Bihui H. Ye, Florigio Lista, et al. 1994. “Rearrangements of the BCL6 Gene in Diffuse Large Cell Non-Hodgkin’s Lymphoma.” Blood 83 (7): 1757–59. https://doi.org/https://doi.org/10.1182/blood.V83.7.1757.1757.
Lossos, Izidore S., Debra K. Czerwinski, Ash A. Alizadeh, et al. 2004. “Prediction of Survival in Diffuse Large-b-Cell Lymphoma Based on the Expression of Six Genes.” New England Journal of Medicine 350 (18): 1828–37. https://doi.org/10.1056/NEJMoa032520.
Petrich, Adam M., Chadi Nabhan, and Sonali M. Smith. 2014. “MYC-Associated and Double-Hit Lymphomas: A Review of Pathobiology, Prognosis, and Therapeutic Approaches.” Cancer 120 (24): 3884–95. https://doi.org/https://doi.org/10.1002/cncr.28899.
R Core Team. 2025. R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing. https://www.R-project.org/.
Reddy, Anupama, Jenny Zhang, Nicholas S Davis, et al. 2017a. “Genetic and Functional Drivers of Diffuse Large b Cell Lymphoma.” Cell 171 (2): 481–494.e15. https://doi.org/10.1016/j.cell.2017.09.027.
Reddy, Anupama, Jenny Zhang, Nicholas S Davis, et al. 2017b. Supplementary Data for "Genetic and Functional Drivers of Diffuse Large b Cell Lymphoma": Table S1 (Clinical Information and Genetic Alteration Data for 1,001 DLBCL Samples) and Table S2 (ABC/GCB Classification Using Gene Expression Data). Supplementary Materials, Cell, doi: 10.1016/j.cell.2017.09.027.
Reiner, Anat, Daniel Yekutieli, and Yoav Benjamini. 2003. “Identifying Differentially Expressed Genes Using False Discovery Rate Controlling Procedures.” Bioinformatics 19 (3): 368–75. https://doi.org/10.1093/bioinformatics/btf877.
Shipp, Margaret A, David P Harrington, James R Anderson, et al. 1993. “A Predictive Model for Aggressive Non-Hodgkin’s Lymphoma.” New England Journal of Medicine 329 (14): 987–94. https://doi.org/10.1056/NEJM199309303291402.
The Non-Hodgkin’s Lymphoma Classification Project. 1997. “A Clinical Evaluation of the International Lymphoma Study Group Classification of Non-Hodgkin’s Lymphoma.” Blood 89 (11): 3909–18. https://doi.org/10.1182/blood.V89.11.3909.
Wang, Sophia S. 2023. “Epidemiology and Etiology of Diffuse Large b-Cell Lymphoma.” Seminars in Hematology 60 (5): 255–66. https://doi.org/https://doi.org/10.1053/j.seminhematol.2023.11.004.
Wickham, Hadley, and Jennifer Bryan. 2025. Readxl: Read Excel Files. https://doi.org/10.32614/CRAN.package.readxl.
Wright, George, Bruce Tan, Andreas Rosenwald, Elaine H. Hurt, Adrian Wiestner, and Louis M. Staudt. 2003. “A Gene Expression-Based Method to Diagnose Clinically Distinct Subgroups of Diffuse Large b Cell Lymphoma.” Proceedings of the National Academy of Sciences 100 (17): 9991–96. https://doi.org/10.1073/pnas.1732008100.
Zhou, Zheng, Laurie H. Sehn, Alfred W. Rademaker, et al. 2014. “An Enhanced International Prognostic Index (NCCN-IPI) for Patients with Diffuse Large b-Cell Lymphoma Treated in the Rituximab Era.” Blood 123 (6): 837–42. https://doi.org/10.1182/blood-2013-09-524108.